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Abstract: We employ two different statistical tests to examine whether, in the framework 
of the Constrained MSSM, the experimentally determined values of BR{B Xs^) and 
the anomalous magnetic moment of the muon, (g — 2)^ are consistent with each other. Our 
tests are designed to compare the theoretical predictions of the CMSSM in data space with 
the actual measurements, once all of the CMSSM free parameters have been integrated 
out and constrained using all other available data. We investigate the value of {g — 2)^ as 
obtained by using e'^e~ data alone (which shows a ~ 3cj discrepancy with the Standard 
Model prediction) and as obtained based on r decay data (which shows a much milder, 
Ifj discrepancy). We find that one of our tests returns either a statistically inconclusive 
result or shows weak evidence of tension between BR{B Xs^) and the e''~e~-data based 
value of {g — 2)^. On the other hand, our second test, which is more stringent in this 
application, reveals that the joint observations of BR{B — > ^^7) and {g — 2)^ from e^e~ 
data alone are incompatible within the CMSSM at the ~ 2a level. On the other hand, 
for both tests we find no significant tension between BR{B Xg^) and the value of 
{g — 2)^ evaluated using r decay data. These results are only weakly dependent on the 
three different priors that we employ in the analysis. We conclude that, if the discrepancy 
between the Standard Model and the experimental determinations of (g — 2)^ is confirmed 
at the ~ 3(7 level, this could be interpreted as strong evidence against the CMSSM. 
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1. Introduction 

Softly broken low-energy supersymmetry (SUSY) is considered to be perhaps the most 
promising theory beyond the Standard Model (SM). Not only does it provides an elegant 
solution to the hierarchy problem [Q] but also naturally accommodates gauge coupling 
unification Q and offers a clue to the dark matter (DM) problem in the Universe |Q. 

On the other hand, without specifying a complete underlying mechanism of SUSY 
breaking, the general Minimal Super symmetric Standard Model (MSSM) suffers from a 
large number of SUSY-breaking soft parameters which are poorly determined. Motivated 
by a natural link between SUSY and grand unified theories (GUTs), over the last several 
years it has become customary to impose various boundary conditions at the GUT scale and 
explore resulting SUSY phenomenology. The most popular model of this class is the Con- 
strained MSSM (CMSSM) 1], includes the minimal supergravity model (mSUGRA) [§. In 
this scheme one defines all SUSY parameters at the unification scale Mgut and next em- 
ploys the Renormalization Group Equations (RGEs) to evolve them down and compute the 
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couplings and masses in an effective tfieory valid at the electroweak scale. The CMSSM is 
defined in terms of four continuous free parameters: common scalar (mo), gaugino {1^1/2) 
and tri-linear (^0) mass parameters (all specified at the GUT scale), plus the ratio of 
Higgs vacuum expectation values tan/3; and one discrete parameter sgn(/x), where ^ is the 
Higgs/higgsino mass parameter whose square is computed from the conditions of radiative 
electroweak symmetry breaking (EWSB). 

The phenomenology of the CMSSM has been studied in a vast number of papers. 
The usual approach has been to explore the model by performing fixed grid scans in mi/2 
and mo for fixed, "representative" values of tan/? and for Aq = |^], and also for fixed 
values of SM parameters, e.g., the top mass nit, which can have a large impact on results 
especially at large mo. Also, the model's predictions for observable quantities, e.g., the relic 
abundance Q^/i^ of the neutralino, or Higgs and superpartner masses have been compared 
with experimental data in a simplified way: if the predicted values is within some arbitrary 
range, typically 1 o" or 90% CL then the point is treated as "allowed"; otherwise it is 
rejected. Theoretical errors are also typically ignored. A approach applied in 0, ^, |9| 
addressed the latter problems. On the other hand, in those papers it was advocated to 
reduce the effective number of CMSSM parameters by using the well-measured value of 
O^/i^ to determine a "surface" in the model's parameter space was somewhat questionable 
as its shape and "thickness" can critically depend on the actual value of m^, especially at 
large mo (compare fig. 4 in |jl^) and also by including a "fudge factor" in the definition 
of m order to suppress the contribution of the large mo region (see eq. (1) in j^). In 
a more recent analysis that element of the analysis has been abandoned. On the 



other hand, the conclusions of |11| heavily rely on the somewhat uncertain discrepancy 
between the SM and the experimental determinations of the anomalous magnetic moment 
of the muon {g — 2)^. 

Over the last few years a new approach based on Bayesian statistics linked with either 
Markov Monte Carlo Chain (MCMC) |T2|, |14|, 0, IT^, |l8|, |l9|, ^ or 

Nested Sampling (NS) scanning methods ^ has been successfully applied in a well 

defined statistical framework (see, e.g., [^]). Furthermore the priors issue, considered as a 
"soft spot" of the Bayesian approach, has been recently thoroughly addressed and has been 
shown to embody in a quantitative manner the physical fine-tuning of the theory, see |27|. 

One of the outcomes of the most recent and more sophisticated scans has been to 
realize that even the CMSSM, with its relative economy of free parameters, remains 
presently somewhat underconstrained by currently available data, thus leaving large re- 
gions of CMSSM parameters allowed |2^]. Despite this, it was pointed out in ||l^, and 
investigated in more detail in |^5|, that there appears to exist a certain "tension" between 
the current measurements of BR{B — > Xs^) (hereafter denoted by 6 — > 57 for brevity) 
and {g — 2)^, in the sense that the two observables favor different regions of the CMSSM 
parameters space (compare figs. 8 and 10 in [^]). This is because the BR{B — > ^^7) 
constraint favors the focus point (FP) region |p8| , p9|| , as the (always positive) charged 
Higgs/top contribution has to be large enough so that, starting from the SM central value 
of 3.12 X 10~^, the (negative, for > 0) chargino/stop contribution can bring the sum down 
to the experimental central value of 3.55 x 10^^. This requires the charged Higgs to be 
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light enough and the stop (or chargino, or both) to be heavy enough. Both conditions are 
satisfied in the FP region. On the other hand, large corrections to the {g — 2)^ values arise 
mostly in the low-mass region. We feel that it would be interesting to further investigate 
this tension, and to develop statistical tools to quantify the possible incompatibility of the 
two observations within the theoretical model. A strong tension between the data would 
then be interpreted either as a sign of an undetected (or underestimated) systematic error 
in one of the data sets, or as a sign that the theoretical model is at odds with the data and 



hence is disfavored. One first evaluation of the tension has been carried out in [24| using a 
model comparison test, returning however an inconclusive result. The purpose of this paper 
is to re-consider the problem of the tension between these two observables by addressing 
it with a novel statistical test, called "the predictive likelihood ratio test", or the ^-test 
introduced below. We clarify that the reason why the test based on model comparison 



performed in |24] is inconclusive can be traced back to the orientation of degeneracies in 
data space, a feature that in this particular context makes the model comparison test less 
stringent than the new test introduced here. 

The paper is organized as follows. In section ^ we introduce the statistical framework 
and define the new test for the compatibility of observables based on the predictive data 
distribution (the .if -test) as well as present the test based on the model comparison (the 
^-test). In sec. ^ we specify the theoretical model and its parameters and the priors we 
consider and we apply the statistical tests to the CMSSM. We present our numerical results 
in section |^. Our conclusions are given in section |5[ 

2. Setup 

2.1 Statistical framework 

We follow here the notation and conventions of our previous works [|l5t |l7| , [25[| . We denote 
the set of parameters of the model A4 under consideration (here the CMSSM) by 9, and 
by "0 all other relevant parameters, the so-called nuisance parameters, which here include 
relevant SM quantities. Both sets form our basis parameters 

m = ((9,V). (2.1) 
Bayesian inference is based on Bayes' theorem which reads 

p(H^,.M) = ^(''"^';^y-^\ (2.2) 
p{a\M) 

The quantity p{m\d, Ai) on the l.h.s. of eq. ( |2.2D is called a posterior probability density 
function (posterior pdf, or simply a posterior). On the r.h.s., the quantity p{d\m,M.), 
taken as a function of m for fixed data d, is called the likelihood. The likelihood supplies 
the information provided by the data. The quantity p{m\M) denotes a prior probability 
density function (prior pdf, or simply a prior) which encodes our state of knowledge about 
the values of the parameters in m before we see the data. The prior state of knowledge is 
then updated to the posterior via the likelihood. 
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Finally, the quantity in the denominator is called the evidence or model likelihood, 
which is obtained by computing the average of the likelihood under the prior (so that the 



r.h.s. of eq. ( |2.2[ ) is properly normalized to unity probability), 

p{d\M)= p{d\m,M)p{m\M)dm. (2.3) 



If one is interested in constraining the model's parameters, the evidence is merely a normal- 
ization constant, independent of m, and can therefore be dropped. However, the evidence 



is very useful in the context of Bayesian model comparison (see e.g. [30, 31] and [16, 24[ 
for recent applications to the CMSSM). One of the main goals of this paper is to develop 
and apply to the CMSSM a new evidence-based statistical test on the consistency of two 
or more observables within a given theoretical model. Taking into account that one might 
wish to consider different models, all of the above relations have been conditioned explicitly 
on the model under consideration, A4. However, in the following we will drop the explicit 
conditioning on M since we only work in the framework of a single, given model in this 
paper, namely the CMSSM. 

2.2 The predictive likelihood ratio test (Jf test) 

Let us split the full data set of n observables d (which will be given below) as d = D}. 
Suppose and that we are interested in testing the compatibility of the observations within 
a subset & = {S^i, . . . , ^k}, k < n, conditional on the observed values for the second 
part of the data set, D = {dfc+i, . . . which are considered as external (independent) 

constraints which are assumed to be correct. We are thus interested in evaluating the 
conditional evidence p{Si\D), which represents the probability of measuring data Si given 
that data D have been gathered for the remaining n — k observables. In other words, this 
conditional probability can be interpreted as the predictive probability for a measurement of 
the observables S given what has been observed for the other quantities. As a consequence 
of the basic probability manipulation rules, the conditional evidence can be written as 

ViMD) ^ (2.4) 

(recall that we are dropping the conditioning on the model M which is understood) . On the 
r.h.s., the joint evidence p(^, D) is the probability of measuring the joint data set within 
the assumed model, independently of the actual true values of the model's parameters m, 
which have been integrated out in the computation of the evidence, see eq. (|2j). The joint 



evidence has to be evaluated as a function of the possible outcomes of the observations 
of the data set S. This requires evaluating the evidence for a series of possible values 
for S, at each time integrating over the full parameter space of the model. The possible 
data realizations S are different outcomes for the measurements (e.g., different means) 
given the experimental noise, i.e., the reported error of the central value. At the same 
time, the data set D are held fixed at their actual observed values, for, as stated above, 
we assume that this part of the data set is trustworthy and can be used to constrain the 
model's parameters. Notice that while the central values of the data set D are assumed 
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to be correct, the uncertainty on their value is automatically fully accounted for, since we 
integrate over all the model's parameters when computing the evidence and we include 
both experimental and theoretical errors on D. 

Once p{!^\D) is obtained as a function of ^, its evaluation at the observed value 
one to determine the compatibility of the observed data realization 
with the model and the rest of the data, by evaluating the relative probability of obtaining 
such a realization compared to the maximum probability for the data set in question. Let 
us denote by i^"^^^ the values of the data that maximises p{^\D). Then the relevant 
quantity to consider is the ratio 

where we have used eq. ( p.4p in the second equality. This is analogous to a likelihood ratio 
in data space, but integrated over all possible values of the parameters of the model. We call 
the ^-test the predictive likelihood ratio test. If ^{&"^^\D) ~ 1, this shows that both data 
sets are compatible with each other and with the model's assumptions (including the prior 
choice), and therefore we can legitimately use them together to constrain the parameters 
of the model A4. If however ^{^°^^\D) <C 1, we should doubt the consistency of the data 
^ (perhaps considering the possibility of systematic effects) or the model's assumptions 
(i.e., the choice of model or of the assumed form and/or ranges of its priors). If ^-test 
comes out to be weakly dependent on the prior, then this will give us more confidence that 
the conclusions of the statistical test when applied to the assumed model are robust. 

A simple example of the application of the .if-test method to a toy linear model is 
presented in Appendix 

2.3 The model comparison test test) 

A different compatibility test has been employed by |24|, following earlier applications 



in cosmology |33]. The gist of what was called "model comparison test" there can be 



summarized as follows (see [24] for full details). 

The idea is to perform a Bayesian model comparison test between two hypotheses, 
namely Tio, stating that the data & under scrutiny are all compatible with each other 
and with the model, versus TCi, purporting that the observables are incompatible (within 
the assumed model) and hence tend to pull the constraints in different regions of param- 
eter space. For k > 1, the Bayes factor between the two hypotheses, giving the relative 
probabilities (odds) between Tio and Tii is given by 

^= Z'^l"'""' . (2.6) 

Writing again the conditional evidences in terms of the joint evidences, e.g. p{&\D,7iQ) = 
p(^, D\7io) / p{D\Ti.Q) , and noting that p(D\Ti.Q) = p{D\Ti.i) (which follows because the evi- 
dence from the data we are not testing does not depend on the hypothesis being considered), 
eq. ( p.6[ ) can be recast as 

^^^^ = rr/^^^f ^iL M ^l'^o)'-' (^-test, k > 1). (2.7) 
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If instead k = 1, i.e., = Sl\ and we wish to test the consistency of one single new 
observation, then eq. ( p. 7] ) needs to be modified to 



p(^l,z?|?^o) 
p(^l|Hl)p(z)|?^l) 



-test, k = 1). 



(2. 



Eqs. (2/7) and (2^) are then evaluated at the observed value of the data sets being tested, 
i.e. for ^ = S)"^^. If ln^{S^°^^) > 0, this is evidence in favour of the hypothesis Tio 
that the data are compatible. If instead ln^{&°^^) < the alternative hypothesis 7ii 
that there is a tension among the data (and the model) is preferred. More quantitatively, 
the strength of evidence for either case can be assessed against so-called "Jeffreys' scale" , 
which we report in table || along with our (slightly modified) convention for denoting the 
different levels of evidence. 



|ln^| 


Odds 


Strength of evidence 


< 1.0 


<3:1 


Inconclusive 


1.0 


~ 3 : 1 


Weak evidence 


2.5 


~ 12 : 1 


Moderate evidence 


5.0 


~ 150 : 1 


Strong evidence 



Table 1: Empirical scale for evaluating the strength of evidence (so-called "Jeffreys' scale"). 
Threshold values are empirically set, and they occur for values of the logarithm of the Bayes factor 
between the hypotheses of \ h\M\ — 1.0, 2.5 and 5.0. The right-most column gives our convention 
for denoting the different levels of evidence above these thresholds, according to the prescription 
in m. 



In applying the test to the CMSSM below we will consider the cases k = 1 and k = 2 
with, as mentioned above, the two pieces of data being tested for mutual consistency being 
b ^ and S{g — 2)^ (the latter both from r decay and e~^e~ data separetely). 



3. An application to the CMSSM 

Before we apply the above formalism to the CMSSM, we first specify the priors tested and 
the experimental constraints. 

3.1 Choice of priors and data 

In order to assess the robustness of our results with respect to plausible changes of priors, 
we consider three different classes of priors: 

• flat prior: flat on mo, mi/2i ^Oi tan /3, with ranges as given in section 3.2 of |25[ ]; 

• log prior: fiat on lnmQ,lnmi/2, AQ,tan (3, with ranges as given in section 3.2 of p5| ; 

• CCR mSUGRA prior: flat on mo, ?t^i/2) ^O; B but with an effective "penalty term" 
that naturally leads to low flne tuning among SUSY parameters. 
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Unlike the first two priors, which refer to the CMSSM parameterization in terms 
of its parameters 9 = (mi/2,'T^0)^0)tan/3), the third prior, as introduced in ref. |27] by 
Cabrera, Casas and Ruiz de Austri (hence the name), is appUed to mSUGRA with its basic 
parameters ""^O) -B, augmented by the top Yukawa couphng yt. A marginahzation 
over selects the value /io that reproduces the experimental value of Mz- As shown in 
ref. 1 27], it is also convenient and natural to trade the parameter B for tan (3 and yt for the 
top mass mt oc yt sin (3. This procedure results in an effective prior 



Pefr("it,"io,mi/2, Ao,tan/3) = J\^=f,^,p{yt,mo,mi/2, Aq, B, fi = /io). 



(3.1) 



Assuming a flat prior on the parameters ^O; B and a log prior on yt, the Jacobian 

term acts as an effective "penalty term" that favors lower values of /x and tan (3, and thus 
leads to less fine-tuning as in the focus point region. In the CMSSM this corresponds to 
large ttiq. On the other hand, the changing of parameters from B to tan/3 favors large 
because of the B dependence on mi/2 in the RGEs. (See ref. |27| for more details.) 



As we shall see, our results are largely insensitive to the choice of priors, which indicates 
a remarkable robustness of this statistical test. This can be traced backed to the fact 
that the parameters within the model are fully integrated out in the computation of the 
predictive probability. 



Observable 


Mean value Uncertainties 

jLt 0" (exper.) r (theor.) 


ref. 




29.5 8.8 1.0 
8.9 9.5 1.0 


0] (e+e- data) 
II] (r data) 


BR(B Xs-f) X 10^ 


3.55 0.26 0.21 


m 



Table 2: Summary of the observables ^ being tested for consistency. 

The focus of this paper is to test for consistency the measured values of 6 — > 57 and 
the anomalous magnetic moment of the muon, [g — 2)^. For the latter, we consider two 
sets of measurements: the first is based on e+e" data, and it gives a ~ 3.2(T discrepancy 
with the SM predicted value |3^; the second one employs r decay data to evaluate the 

2)^ instead, which leads to a much better agreement, 



These values and their uncertainties are listed in the 



SM hadronic contribution to {g — 
^flSUSY = (8.9 ± 9.5) X 10-10 JH 

top part of Table 

As regards BR{B — > ^^7), for the new SM prediction we obtain the value of (3.12 it 
0.21) X We compute SUSY contribution to BR{B ^^7) following the procedure 

outlined in refs. ]^, ^ which was extended in refs. ]^, ^ to the case of general flavor 
mixing. In addition to full leading order corrections, we include large tan /3-enhanced 
terms arising from corrections coming from beyond the leading order and further include 
(subdominant) electroweak corrections. 



^The value of (3.15 ± 0^) x 10"'' originally derived in ref. ^ ||l was obtained for slightly different 



values of Mt and aa{Mz 



Note that, in treating the error bar we have explicitly taken into account 



the dependence on Mt and Qs(Afz)*^'^, which in our approach are treated parametrically. This has led to 
a slight reduction of its value. 
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Observable 



Mean value 



Uncertainties 
a (exper.) r (theor.) 



Nuisance paramaters 



\MS 



Mt 

as{Mz)'^_ 
l/aem(Mz)*''^ 



172.6 GeV 
4.20 GeV 
0.1176 
127.955 



1.4 GeV N/A 

0.07 GeV N/A 

0.002 N/A 

0.03 N/A 



Observables (measured) 



Mw 
sin^ 9es 

BR{Bu 



Tv) X 10^ 



80.398 GeV 
0.23153 
17.77 ps~i 
1.32 
0.1099 



25 MeV 15 MeV 

16 X 10"^ 15 X 10-^ 

0.12 ps-i 2.4 ps-i 

0.49 0.38 

0.0062 Q.lVL^h'^ 



Observables (limits) 
Limit (95% CL) 



r (theor.) 



BR{Bs ^ //+/U-) 
rrih 

nig 
m-g 

other sparticle masses 



< 5.8 X 10-*^ 

> 114.4 GeV (SM-like Higgs) 
f{mh) (see ref. |jl5|) 

> 375 GeV 

> 289 GeV 

As in table 4 of ref. |]15| 



14% 
3 GeV 
negligible 



Table 3: Summary of the observables D used in the analysis, on which the consistency test is 
conditional. Upper part: measurements on nuisance (SM) parameters. (N/A stands for "not 
applicable".) Central part: Observables for which a positive measurement has been made. Lower 
part: Observables for which only limits currently exist. For details, see the treatment in ref. | [l5| . 



All the other experimental values of the collider and cosmological observables that we 
assume in order to perform the compatibility test for 5a^^^^ and BR{B — > Xg'f ) are listed 
in table ^. We refer to |jl5|, |2^] for details about the computation of each quantity 
and for justification of the theoretical errors adopted, as well for a detailed description of 
the likelihood function. In particular, points that do not fulfil the conditions of radiative 
EWSB and/or give non-physical (tachyonic) solutions are discarded. Also, we take > 0, 
because of its correlation with sign of 5{g — 2)^. 

3.2 Applying the ^ test to the CMSSM 

We are interested in assessing the compatibility of = {6 — > 57, (5(5 — 2)^}, while assuming 
all the other data (denoted by D) to be believable. We remind the reader at this point 
that we are concerned with making predictions in data space, and not in parameter space, 
as it is usually done. We are not interested in constraining the parameters of the model 
here, but instead integrate over all their possible values. Therefore, the resulting values of 
6 — > S7 and 5{g — 2)^ should be understood to represent the mean values that are predicted 
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to be obtained experimentally for the respective quantities within the CMSSM, once all 
the other constraints on the r.h.s. of the conditioning bar {D, as in table ^) are taken into 
account (including their experimental and theoretical uncertainties). 

We thus evaluate the evidence and compute the predictive probability on a grid of 
values for (6 — > sj,5{g — 2)^) representing the possible outcomes for the central value of 
the observation. At each point we keep the same experimental error as the one that has 
been effectively reported by the experiments (adding the theoretical error on top), as given 
in table |2[ In other words, we consider different possible outcomes for the central values 
but with fixed instrumental noise properties, which is a reasonable assumption. 

The computation of the evidence is numerically costly, as it involves an 8-dimensional 
integral over the whole parameter space for every choice of data values one wishes to test. 
We employ a modified version of the SuperBayeS package [17| including the MultiNest 



algorithm [^], which allows one to compute the evidence and from there, the predictive 
probabilities involved in the ^-test. Despite MultiNest 's high efficiency, each evidence 
evaluation still requires about 3 days of CPU time on 4 3.00 GHz Intel Woodcrest proces- 



sors. Appendix of ref. |25] provides a full description of how the uncertainty on the value 
of the evidence is evaluated with MultiNest. This uncertainty is then propagated to the 
uncertainty on the ^-test of eq. (|2.5|) . 

We scan over the following central values for the experimental outcomes, chosen to 
bracket the actually observed values: 

BR(B Xsj) X 10"^ : 1.5, . . . , 4.0 in intervals of 0.5 (3.2) 
6{g - 2)^ X 10^° : 0, . . . , 40 in intervals of 5. (3.3) 

As for the experimental noise, we fix this to the actually reported value for the real obser- 
vation, supplemented by a suitable theoretical error, as given in table When considering 
the two different experimental determinations of 6{g — 2)^ (one based on r decay data and 
one based on e~^e~ data alone), we should in principle repeat our test using the reported 
experimental error for each of the observations. However, the reported experimental errors 
on 6{g — 2)^ for the two determinations of the quantity are very similar (within about 10%) 
and therefore we employ the uncertainty reported in using the e^e~ data for both. This 
approximation is not expected to influence significantly our result. 



4. Numerical results 



It is interesting to consider both the .^-test and the .^-test, for each of them is sensitive 
to possible tensions between the observables in a different way and may in general give 
different results. (This is demonstrated in a toy model example in Appendix ^.) This is 
in fact not surprising, for while different measurements can be compatible with each other 
and also compatible with the model being fitted in only one way if all the measurements 
are correct and the theory is the right model, there are many different ways in which an 
incompatibility could manifest itself. The ^-test asks what is the probability of measuring 
a certain value for the data subset ^ (relative to the maximum probability achievable under 
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the model) given what is known about the model from the remaining data D. The M-iesi 
instead tries to enforce consistency between the data being tested and the remaining data 
sets, by looking for values of ^ that are jointly compatible with the parameter space singled 
out by D. These two approaches show subtle differencies and in general play out differently 
whenever a genuine tension between the observables exists. 

Furthermore, the two tests are evaluated on different scales: the =^-test being of the 
form of a likelihood ratio test can be evaluated on a significance scale analogous to the 
usual Ax^ rule, while the ^-test (representing odds between two hypotheses) should be 
assessed against Jeffreys' scale for the strength of evidence. Another issue to consider is 
that in general the two tests favour different degenerate regions of data space (see fig. ^ in 
Appendix ^ for an illustration) and one can easily imagine situations where one of these 
regions is more constrained than the other, due to the structure of the model. In this 
case, the test that exhibits more power along this more constrained degenerate region will 
appear to be more stringent. 

4.1 Results for the .if test 

We begin by employing the .if-test to separately test the consistency between = 5 — > 
57 and the other observables D (but excluding from the latter 5{g — 2)^) and between 
^1 = ^{9 ~ 2)/i and the other observables (but excluding from the latter b — > 57). Notice 
in particular that we do include the dark matter constraint in the assumed data D. The 
outcome of these two tests is shown in fig. |l] and reported in table ^. In the left panel of 
fig. 0, we plot the quantity ^{BR{B Xs'^)\D) as a function of the possible outcome of 
the experimental observation, with the actual observed central value indicated by a vertical, 
solid line. In the right panel of fig. |l|, we plot instead ^{5{g — 2)^\D) as a function of 
the possible measured values of 5{g — 2)^, indicating by vertical lines the actual observed 
values from e^e~ and r data. 

The CMSSM, once all the observations other than 5{g — 2)^ are accounted for, tends 
to predict a 6 — > 57 value close to the SM prediction, BR{B ^^7) ~ 3.12 x 10^ (the 
precise value depending on the actual values of SM input parameters, especially rrit and 
as{Mz)^^^)i with predominantly small negative corrections arising from chargino-stop loop 
contributions. This is shown by the peak in the predictive distribution, which occurs at 
around BR{B — > X<j7) ~ 3 x 10'*. The experimental central value 3.55 x 10^ is within about 
Ifj of the most likely value, thus it is not significantly in tension with the other observables 
(see top part of table ^). 

Turning next to the anomalous magnetic moment of the muon (right panel of fig. 
the predictive probability is largest for 5{g — 2)^ ~ 0, as might be expected from noting that 
only a small fraction of the CMSSM parameter space gives rise to sizable SUSY corrections 
to 5{g — 2)^. The probability remains almost fiat out to 5{g-2)^ < lOx 10~^°, which means 
that the r decay data determination is perfectly compatible with all other observations. 
Indeed, the & results in the bottom part of table ^ show that the results are not significant 
for the T decay data. However, the predictive probability drops fairly steeply above that 
value (compare fig. thus leading to tension for the e'^e^ data, at about the 2a level for 
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Figure 1: Predictive data distribution (^-test) for b ^ sj (left panel) and 5{g~ 2)^ (right panel) 
in the CMSSM for three different choices of priors: flat prior (red/solid), log prior (blue/dotted) 
and the CCR mSUGRA prior (green/dashed). The predictive distributions are conditional on all 
other observations, excluding S{g — 2)^ and b — > sj. The vertical lines give the actual measured 
values. The errorbars denote the location at which the predictive probability has been computed 
(and its error), while the lines are a smoothed spline. 



Prior 


lnif(^i|D) 


Interpretation 


^1 = 


BR{B ^ X,7) 






Flat 




-1.63 ±0.11 


Not significant (1.28(t) 




Log 




-1.43 ±0.12 


Not significant (1.20(t) 




CCR 


mSUGRA 


-0.89 ±0.13 


Not significant (< Icr) 




^1 = 


6{g — 2)^ from e+e data 






Flat 




-3.99 ±0.10 


Incompatible at 95.4% sij 


;nificance 


Log 




-2.69 ±0.10 


Not significant (1.64(t) 




CCR mSUGRA 


-5.59 ±0.10 


Incompatible at 98.2% sij 


^nificance 


^1 = 


6{g — 2)^ from r decay data 




Flat 




-0.24 ±0.10 


Not significant (< la) 




Log 




-0.38 ±0.08 


Not significant (< lo") 




CCR 


mSUGRA 


-0.30 ±0.08 


Not significant (< la) 





Table 4: Results of the ^-test, testing for consistency of BR{B Xsj) with all other data 
(excluding 5{g~2)^) and for consistency of 5{g~2)^ with all other data (excluding BR{B — > Xs"())- 



both the flat and the CCR mSUGRA prior. The significance is reduced to ~ 1.6a under 
the log prior. 

It is interesting how the predictive probabilities are almost independent on the choice 
of priors on the model's parameters, thus indicating a remarkable robustness in the model's 
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Prior 


ln^{6{g-2)^,b^s^\D) 


Interpretation 


6{g — 2)^ from e~^e data 


Flat 
Log 

OCR mSUGRA 


-5.99 ±0.13 
-5.87 ±0.13 
-6.42 ±0.14 


Incompatible at 95.0% significance 
Incompatible at 94.7% significance 
Incompatible at 96.0% significance 


S{g — 2)^ from r decay data 


Flat 
Log 

CCR mSUGRA 


-1.59 ±0.07 
-1.70 ±0.07 
-1.62 ±0.07 


Not significant (1.26o") 
Not significant (1.30(t) 
Not significant (1.27(t) 



Table 5: Results of the Jf-test, jointly testing 6 ^ 57 and 6{g — 2)^ for mutual compatibility and 
compatibility with all other observations, D. We find that the S{g — 2)^ observation from e+e^ 
data is incompatible with b 57 at the ^ 95% level, almost independently of the choice of prior. 
On the other hand, no significant tension is detected for the S{g — 2)^ measurement from r decay 
data. 



predictions. 



L-test 



L-test 



log prior 



L-test CCR mSUGRA prior 




2 _ 3 4 

BR(B X37) X 104 



2 _ 3 4 

BR(B X37) X 10" 



2 _ 3 4 

BR(B X37) X 10-* 



Figure 2: ^-test for both b sj and S{g — 2)^, for flat priors (left panel), log priors (middle 
panel) and CCR mSUGRA priors (right panel) in the CMSSM. The cross give the actual observed 
values (for the two different 6{g — 2)^ determinations) and the green diamond is the most probable 
value under the model. Contours delimit values of ln^{S{g — 2)^,6 sj\D) — 2.3,6.17,11.80, 
corresponding to joint 1, 2, Scr significance regions. The black, small dots indicate the locations at 
which the predictive probability has been evaluated, while the contours are interpolated. 



We now consider the case where we test both b ^ s-f and 6{g — 2)^ jointly, conditional 
on all other data. The result for the ^-test with ^ = (6 — > 57, 6{g — 2)^) is shown in 
fig. |2| and reported in table ^. We can see that the CMSSM, given the observations D, 
tends to prefer small corrections to 5{g — 2)^, although less so in the case of the log prior 
which gives more weight to the low-mass region where SUSY corrections tend to be larger. 
The joint observation of 6 ^ 57 and the determination of 5{g — 2)^ based on r decay 
data lies within the Icr region, and hence no significant tension is detected between these 
two datasets. However, the e^e~ data based determination of 5{g — 2)^ shows a ~ 2(T 
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Prior 




Interpretation 


5{g — 2)^ from e+e data 


Flat 
Log 

CCR mSUGRA 


-0.62 ±0.20 
-2.04 ±0.20 
-0.69 ±0.25 


Inconclusive evidence 

Weak evidence for incompatibility 

Inconclusive evidence 


6{g — 2)^ from r data 


Flat 
Log 

CCR mSUGRA 


0.17 ±0.19 
-0.64 ±0.17 
-0.58 ± 0.23 


Inconclusive evidence 
Inconclusive evidence 
Inconclusive evidence 



Table 6: Results of the ^ii-test, giving the relative odds (Bayes factor) between the two hypotheses 
that S{g — 2)^ and b ^ are mutually compatible (corresponding to In.^ > 0) or that they are in 
tension with each other and/or the rest of the data, D (corresponding to In^ < 0). The statistical 
interpretation is in accordance with Jeffreys' scale, given in table 0. 

significance for incompatibility (compare top part of table ^) , which indicates an emerging 
tension between the two observations. The r.h.s. panel of fig. |2| gives the result for the 



CCR mSUGRA prior, which as it is shown in ref. [27| prefers the focus point region and 
large gaugino masses, penalizing large tan (3 values. This implies that, under this prior, 
regions of parameter space are favoured where the decoupling of SUSY contributions to 
5{g — 2)n occurs and where negative contributions of the chargino-stop loop to 6 — > 57 are 
suppressed. (Notice how the 3cr significance region in this case is much more extended). 
Despite this, the above results hold essentially unchanged even for this choice of prior. 

4.2 Results for the ^ test 

Turning now to the ^-test, we summarize the results in table |^ and plot the outcome in 
fig. ^. The ^-test for 6 — > 57 and S{g — 2)^ returns an inconclusive result for all choices 
of priors, except for the case of log prior and 6{g — 2)^ based on e~^e~ data alone, which 
instead shows weak evidence for incompatibility. The reason for this result can easily be 
understood by considering fig. ^ which shows that for almost all possible observed values 
of 6 ^ 57 and 6{g — 2)^ is undecided. Regions of large positive SUSY contributions to 
6{g — 2)^ and large negative corrections to 6 — > 57 would be favoured (upper left corner 
of fig. ^), while regions of large positive corrections to 6 — > 57 and large 6{g — 2)^ values 
are disfavoured (upper right corner). The relative size of those region is somewhat prior 
dependent. This comes about in an analogous fashion as for the toy model presented in the 
Appendix: the M-test deems observed values to be compatible if they tend to come from 
"compensating" regions of parameter space. However, by comparing fig. |3| with fig. it is 
apparent that in this context the .if-test is the more stringent of the two, while the M-tesi 
remains quite lenient, at least given the current experimental error on 6{g — 2)^, h — > 57. 

5. Conclusions 



2)^ to a 



We have subjected the question of the mutual compatibility of 6 — > 57 and 5{g 
detailed scrutiny, employing two different statistical tests that look for possible inconsis 
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R-test flat prior 




2 3 4 

BR(B X37) X 10^ 



R-test log prior 




2 3 4 

BR(B X,7) X 10^ 







R-test 


CCR mSUGRA prior 




\ incornpatible ^ 




/ 6^G data 


E / • 


/ • • ® 


/ 
/ 






undecided 




T data 



2 3 4 

BR(B X37) X 104 



Figure 3: ^-test for both BR{B Xg^) and 5{g — 2)^, for the flat prfor (left panel), log prfor 
(middle panel) and CCR mSUGRA prior (right panel) in the CMSSM. The encircled crosses give 
the actual observed values (for the two different &{g — 2)^ determinations). Contours delimit values 
of I \n^{6{g — 2)i^i,b — > S7|£))| = 1.0,2.5, corresponding to levels of weak and moderate strength 
of evidence for either hypothesis, respectively, according to the Jeffreys' scale. The region in the 
top left corner favours Hq (that the two measurements are compatible), while the top right corner 
favours Tii (incompatible measurements). The white region returns an undecided result. 



tencies between the two quantities and between the quantities and the model, in our case 
the CMSSM. We have found no sign of tension between 6 — > 57 and the r decay derived 
measurement of 5{g — 2)^ under either test. On the other hand, our most stringent test 
shows a ~ 95% indication of tension between b 57, the e"'"e~-based value oi 5{g — 2)^ 
and the other observed data (including the WMAP 5-yr dark matter determination) . This 
can be interpreted in two ways: either as a sign of undetected systematics in the e~^e~ 
value of 5{g — 2)^, or (perhaps more interestingly) as an early indication of the difficulty 
of the CMSSM to simultaneously explain the observed values of 6 — > 57 and 6{g — 2)^. If 
the ~ 3a discrepancy in the anomalous magnetic moment of the muon is confirmed, this 
could be interpreted as evidence against the viability of the CMSSM. 
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A. Illustration of the consistency tests on a toy problem 

In order to illustrate the use of the Bayesian evidence to quantify the consistency between 
different datasets as discussed in section ^, we apply the ^-test (defined in eq. ( |2.5| )) 
and the .^-test (defined in eq. ( |2.7D ) to the simple linear problem of fitting a straight line 
through a set of data points, and then check for the consistency of a new observation with 
the previous measurements and with the model. 

A.l Toy problem 

We consider that the true underlying model for some process is a straight line described 

by 

y{x) = mx + c, (A.l) 

where the free parameters in the model are the slope m and the intercept c, whose true 
value is assumed to be 1 for both. The data consist of observations yi at known locations 
Xi , with Gaussian noise of known variance a 

yi-y{xi) = e^M{^,(j). (A.2) 

We split the full dataset d in two parts, d = D}, and we wish to test for the consistency 
of the subset S> with the assumed subset D and with the model of eq. ( [A.l| ). The likelihood 
function can then be written as 

£(m, c) = p{d\m, c) = JJp((ij|m, c), (A. 3) 

i 

where ^ 

p{di\m, c) = -^== exp[-xf /2] (A.4) 

and 

i 

where y{xi) is the predicted value of y at a given Xi as a function of c, m and yi is the 
measured value. We impose uniform, U{—5, 5) priors on both m and c. 

A.2 Consistency tests for one observables at the time 

In the first case, we wish to test for the consistency of one new observation with a set of 
previously gathered data points. We take the data set D to consist of 9 data points at 
X = {0, 1, 2, 3, 4, 5, 6, 7, 8} while the data set we wish to test, consists of one observation 
at xg = 9. For definiteness, we use a = 0.5 for the noise. We now employ the .if-test and 
the ^-test to check for consistency between datasets D and !^ = yg = y{xg). We scan 
over the following yg values for x = 9 

& : yg = 7.5, . . . , 12.5 in intervals of 0.5. (A.6) 

The outcome of these two tests is shown in fig. demonstrating that both tests 
correctly identify the region around yg ~ 10 as the one where the datasets are consistent 
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Figure 4: Left panel: toy model illustration of the 1-dimensional ^-test for yg, with the horizontal, 
dashed lines representing levels of l,2,3cr significance. The vertical line is the value that corresponds 
to the true value of the model's parameters. Right panel: toy model illustration of the 1-dimensional 
^-test. The horizontal lines delineate levels of evidence according to the Jeffreys' scale, in favour 
of compatibility (for In^ > 0) or against it (for In^ < 0). Both tests correctly identify the data 
region corresponding to the true model. 



(notice that although the two curves look very similar they are not identical) . According to 
the ^~test (right panel), the consistency hypothesis begins to be disfavoured in the regions 
yg < 8.0 and yg > 11.5, which according to the ^-test corresponds to tension between 
the two datasets at the ~ 2a level (compare the left panel of fig. ^). As it is generically 
the case when comparing hypothesis testing using likelihood ratio and Bayesian model 
comparison, the significance levels of the former appear to give stronger results than the 
strength of evidence from the latter seems to justify. This is well known in the statistical 
literature, and in a particular version of this phenomenon goes under the name of "Lindley's 
paradox". For further details about interpreting and comparing the two results, see |26, 30| 
and references therein. 



A. 3 Consistency tests for two observables jointly 

In order to perform the consistency tests for two new observations jointly, we generate 
8 data points at x = {0,1,2,3,4,5,6,7}, again with Gaussian noise a = 0.5. These data 
points are referred to as D. The dataset & now consists of y(x = 8) = ys and y{x = 9) = yg. 
We scan over the following values for the possible outcomes of the observation for ^ 
(assuming the same noise properties as D): 

ys = 6.5, . . . , 11.5 in intervals of 0.5 (A.7) 
yg = 7.5, . . . , 12.5 in intervals of 0.5. 
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The results of applying the ^-test and the ^~test are shown in fig. |5[ It can be seen clearly 
from fig. ^ that the both tests favour the consistency hypothesis around the region with 
ys ~ 9 and yg ~ 10, which correspond to the outcome for the true value of the parameters 
(marked by an encircled black cross). 



L-test R-test 




Figure 5: Left panel: toy model illustration of the 2-dimcnsional ^-test, showing the true value 
(black, encircled cross), the maximum probability prediction (green diamond) and l,2,3cr contours 
around it. Right panel: toy model illustration of the 2-dimensional ^-test. The black cross 
is the true value, while the black diamond is the point of maximum evidence in favour of the 
hypothesis of compatibility. Contours delineate regions of weak evidence in favour of compatibility 
(innermost /green region), inconclusive result (white), and regions of increasing evidence against 
compatibility (shades of red: weak, moderate and strong evidence according to the Jeffreys' scale). 



It can also be seen from fig. |5| that although there is an overlap between different 
consistent and inconsistent regions favoured by the two tests, they generally prefer different 
regions as they look for inconsistency between datasets in a different manner as discussed 
in section ^. For this particular model of line fitting, the consistent region according to 
the ^-test is the one where the data points ys and yg lie on different sides of the true line 
given by eq. (AT), i.e. the test favours either ys > 9 and yg < 10 or yg < 9 and yg > 10. 
The consistent region according to the ^-test is the one where both data points yg and yg 
lie on the same side of the true line, i.e. either yg > 9 and yg > 10 or yg < 9 and yg < 10. 
This difference between the two consistency tests can be understood by considering that 
the ^-test is trying to determine the probability that the datasets ^ and D all come from 
the same model and so in order to enforce compatibility between them it favours the data 
points yg and yg to lie on different sides of the true line, thus preferring anti-correlated 
values. The ^-test, on the other hand, is trying to fit a straight line model for the given 
data values and so if yg is higher, it favours yg to be higher as well and vice versa, therefore 
favouring correlated behaviour. 
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